146 research outputs found

    A PVM Based Library for Sparse Matrix Factorizations

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Lecture Notes in Computer Science. The final authenticated version is available online at: https://doi.org/10.1007/BFb0056589[Abstract] We present 3LM, a C Linked List Management Library for parallel sparse factorizations on a PVM environment which takes into account the fill-in, an important drawback of sparse computations. It is restricted to a mesh topology and is based on an SPMD paradigm. Our goal is to facilitate the programming in such environments by means of a set of list and vector-oriented operations. The result is a pseudo-sequential code, in which the interprocessor communications and the sparse data structures are hidden from the programmer.Ministerio de Educación; CICYT TIC96-1125-C03Xunta de Galicia; XUGA20605B9

    Characterization of message-passing overhead on the AP3000 multicomputer

    Get PDF
    This is a post-peer-review, pre-copyedit version. The final authenticated version is available online at: http://dx.doi.org/10.1109/ICPP.2001.952077[Abstract] The performance of the communication primitives of parallel computers is critical for the overall system performance. The characterization of the communication overhead is very important to estimate the global performance of parallel applications and to detect possible bottlenecks. In this paper, we evaluate, model and compare the performance of the message-passing libraries provided by the Fujitsu AP3000 multicomputer: MPI/AP, PVM/AP and APlib. Our aim is to fairly characterize the communication primitives using general models and performance metrics.Ministerio de Ciencia y Tecnología; 1FD97-0118-C02

    Sparse Householder QR factorization on a mesh

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing. The final authenticated version is available online at: http://dx.doi.org/10.1109/EMPDP.1996.500566.[Abstract] We analyze the parallelization of QR factorization by means of Householder transformations. This parallelization is carried out on a machine with a mesh topology (a 2-D torus to be more precise). We use a cyclic distribution of the elements of the sparse matrix M we want to decompose over the processors. Each processor represents the nonzero elements of its part of the matrix by a one-dimensional doubly linked list data structure. Then, we describe the different procedures that constitute the parallel algorithm. As an application of QR factorization, we concentrate on the least squares problem and finally we present an evaluation of the efficiency of this algorithm for a set of test matrices from the Harwell-Boeing sparse matrix collection

    Innovación docente en el EEES de cara a la práctica profesional a través del aprendizaje basado en proyectos

    Get PDF
    En este artículo se describe nuestra experiencia en la docencia de Arquitectura e Ingeniería de Computadores en el Máster en Informática de la Universidade da Coruña, en la cual concurrían las circunstancias de titulación EEES de nueva implantación y un número reducido de alumnos. La orientación profesionalizante del máster nos motivó a explorar en innovación docente de cara a la práctica profesional, fundamentalmente a través de metologías de aprendizaje basado en proyectos (project-based learning) combinado con las acciones de: (1) sustitución de docencia teórica por trabajos académicamente dirigidos; (2) impartición de seminarios profesionales; (3) uso de técnicas de role playing; y (4) desarrollo de habilidades comunicativas. La valoración global es que esta metodología y sus acciones asociadas han resultado tremendamente positivas en la docencia de la materia.Peer Reviewe

    Affine Modeling of Program Traces

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in IEEE Transactions on Computers. The final authenticated version is available online at: http://dx.doi.org/10.1109/TC.2018.2853747[Abstract] A formal, high-level representation of programs is typically needed for static and dynamic analyses performed by compilers. However, the source code of target applications is not always available in an analyzable form, e.g., to protect intellectual property. To reason on such applications it becomes necessary to build models from observations of its execution. This paper presents an algebraic approach which, taking as input the trace of memory addresses accessed by a single memory reference, synthesizes an affine loop with a single perfectly nested statement that generates the original trace. This approach is extended to support the synthesis of unions of affine loops, useful for minimally modeling traces generated by automatic transformations of polyhedral programs, such as tiling. The resulting system is capable of processing hundreds of gigabytes of trace data in minutes, minimally reconstructing 100 percent of the static control parts in PolyBench/C applications and 99.9 percent in the Pluto-tiled versions of these benchmarks.Ministerio de Economía y Competitividad; TIN2016-75845-PNational Science Foundation (Estados Unidos); 1626251National Science Foundation (Estados Unidos); 1409095National Science Foundation (Estados Unidos); 1439057National Science Foundation (Estados Unidos); 1213052National Science Foundation (Estados Unidos); 1439021National Science Foundation (Estados Unidos); 162912

    Volatile STT-RAM Scratchpad Design and Data Allocation for Low Energy

    Get PDF
    [Abstract] On-chip power consumption is one of the fundamental challenges of current technology scaling. Cache memories consume a sizable part of this power, particularly due to leakage energy. STT-RAM is one of several new memory technologies that have been proposed in order to improve power while preserving performance. It features high density and low leakage, but at the expense of write energy and performance. This article explores the use of STT-RAM--based scratchpad memories that trade nonvolatility in exchange for faster and less energetically expensive accesses, making them feasible for on-chip implementation in embedded systems. A novel multiretention scratchpad partitioning is proposed, featuring multiple storage spaces with different retention, energy, and performance characteristics. A customized compiler-based allocation algorithm suitable for use with such a scratchpad organization is described. Our experiments indicate that a multiretention STT-RAM scratchpad can provide energy savings of 53% with respect to an iso-area, hardware-managed SRAM cache

    XARK: an extensible framework for automatic recognition of computational kernels

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in ACM Transactions on Programming Languages and Systems. The final authenticated version is available online at: http://dx.doi.org/10.1145/1391956.1391959[Abstract] The recognition of program constructs that are frequently used by software developers is a powerful mechanism for optimizing and parallelizing compilers to improve the performance of the object code. The development of techniques for automatic recognition of computational kernels such as inductions, reductions and array recurrences has been an intensive research area in the scope of compiler technology during the 90's. This article presents a new compiler framework that, unlike previous techniques that focus on specific and isolated kernels, recognizes a comprehensive collection of computational kernels that appear frequently in full-scale real applications. The XARK compiler operates on top of the Gated Single Assignment (GSA) form of a high-level intermediate representation (IR) of the source code. Recognition is carried out through a demand-driven analysis of this high-level IR at two different levels. First, the dependences between the statements that compose the strongly connected components (SCCs) of the data-dependence graph of the GSA form are analyzed. As a result of this intra-SCC analysis, the computational kernels corresponding to the execution of the statements of the SCCs are recognized. Second, the dependences between statements of different SCCs are examined in order to recognize more complex kernels that result from combining simpler kernels in the same code. Overall, the XARK compiler builds a hierarchical representation of the source code as kernels and dependence relationships between those kernels. This article describes in detail the collection of computational kernels recognized by the XARK compiler. Besides, the internals of the recognition algorithms are presented. The design of the algorithms enables to extend the recognition capabilities of XARK to cope with new kernels, and provides an advanced symbolic analysis framework to run other compiler techniques on demand. Finally, extensive experiments showing the effectiveness of XARK for a collection of benchmarks from different application domains are presented. In particular, the SparsKit-II library for the manipulation of sparse matrices, the Perfect benchmarks, the SPEC CPU2000 collection and the PLTMG package for solving elliptic partial differential equations are analyzed in detail.Ministeiro de Educación y Ciencia; TIN2004-07797-C02Ministeiro de Educación y Ciencia; TIN2007-67537-C03Xunta de Galicia; PGIDIT05PXIC10504PNXunta de Galicia; PGIDIT06PXIB105228P

    Compiler support for parallel code generation through kernel recognition

    Get PDF
    [Abstract] Summary form only given. The automatic parallelization of loops that contain complex computations is still a challenge for current parallelizing compilers. The main limitations are related to the analysis of expressions that contain subscripted subscripts, and the analysis of conditional statements that introduce complex control flows at run-time. We use the term complex loop to designate loops with such characteristics. We describe the parallelization of sequential complex loop nests using a generic compiler framework (proposed in an earlier paper [Arenaz et al., ICS'2003] ) that accomplishes kernel recognition through the analysis of the gated single assignment program representation. Specifically, we focus on an extension of this framework that enables its use as a powerful tool for gathering source code information that is relevant for the parallelization of each computational kernel. A set of example codes are analyzed in detail to illustrate the potential of our approach. Experimental results using a benchmark suite of complex loop nests are also presented

    BDWatchdog: real-time monitoring and profiling of Big Data applications and frameworks

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Future Generation Computer Systems. The final authenticated version is available online at: https://doi.org/10.1016/j.future.2017.12.068[Abstract] Current Big Data applications are characterized by a heavy use of system resources (e.g., CPU, disk) generally distributed across a cluster. To effectively improve their performance there is a critical need for an accurate analysis of both Big Data workloads and frameworks. This means to fully understand how the system resources are being used in order to identify potential bottlenecks, from resource to code bottlenecks. This paper presents BDWatchdog, a novel framework that allows real-time and scalable analysis of Big Data applications by combining time series for resource monitorization and flame graphs for code profiling, focusing on the processes that make up the workload rather than the underlying instances on which they are executed. This shift from the traditional system-based monitorization to a process-based analysis is interesting for new paradigms such as software containers or serverless computing, where the focus is put on applications and not on instances. BDWatchdog has been evaluated on a Big Data cloud-based service deployed at the CESGA supercomputing center. The experimental results show that a process-based analysis allows for a more effective visualization and overall improves the understanding of Big Data workloads. BDWatchdog is publicly available at http://bdwatchdog.dec.udc.es.Ministerio de Economía, Industria y Competitividad; TIN2016-75845-PMinsiterio de Educación; FPU15/0338
    • …
    corecore